Goto

Collaborating Authors

 successor uncertainty



Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning

Neural Information Processing Systems

Posterior sampling for reinforcement learning (PSRL) is an effective method for balancing exploration and exploitation in reinforcement learning. Randomised value functions (RVF) can be viewed as a promising approach to scaling PSRL. However, we show that most contemporary algorithms combining RVF with neural network function approximation do not possess the properties which make PSRL effective, and provably fail in sparse reward problems. Moreover, we find that propagation of uncertainty, a property of PSRL previously thought important for exploration, does not preclude this failure. We use these insights to design Successor Uncertainties (SU), a cheap and easy to implement RVF algorithm that retains key properties of PSRL. SU is highly effective on hard tabular exploration benchmarks. Furthermore, on the Atari 2600 domain, it surpasses human performance on 38 of 49 games tested (achieving a median human normalised score of 2.09), and outperforms its closest RVF competitor, Bootstrapped DQN, on 36 of those.



Reviews: Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning

Neural Information Processing Systems

This paper proposes using Bayesian linear regression to get a posterior over successor features as a way of representing uncertainty, from which they sample for exploration. I found the characterization of Randomised Policy Iteration to be strange, as it only seems to apply to UBE but not bootstrapped DQN, With bootstrapped DQN, each model in the ensemble is a value function pertaining to a different policy, thus there is no single reference policy. The ensemble is trying to represent a distribution of optimal value functions, rather than value functions for a single reference policy. Proposition 1: In the case of neural networks, and function approximation in general, it is very unlikely that we will get a factored distribution, so this claim does not seem applicable in general. In fact, in general there should be very high correlation between the q-values between nearby states. Is this claim a direct response to UBE? Also the analysis fixes the policy to consider the distribution of value functions, but this seems to not be how posterior sampling is normally considered, but rather only the way UBE considers it.


Reviews: Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning

Neural Information Processing Systems

From the discussion, the reviewers appreciated the precisions made in the rebuttal. They have indicated what they would like to see improved in a revised version, in particular a clearer presentation.


Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning

Neural Information Processing Systems

Posterior sampling for reinforcement learning (PSRL) is an effective method for balancing exploration and exploitation in reinforcement learning. Randomised value functions (RVF) can be viewed as a promising approach to scaling PSRL. However, we show that most contemporary algorithms combining RVF with neural network function approximation do not possess the properties which make PSRL effective, and provably fail in sparse reward problems. Moreover, we find that propagation of uncertainty, a property of PSRL previously thought important for exploration, does not preclude this failure. We use these insights to design Successor Uncertainties (SU), a cheap and easy to implement RVF algorithm that retains key properties of PSRL. SU is highly effective on hard tabular exploration benchmarks.


Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning

Janz, David, Hron, Jiri, Mazur, Przemysław, Hofmann, Katja, Hernández-Lobato, José Miguel, Tschiatschek, Sebastian

Neural Information Processing Systems

Posterior sampling for reinforcement learning (PSRL) is an effective method for balancing exploration and exploitation in reinforcement learning. Randomised value functions (RVF) can be viewed as a promising approach to scaling PSRL. However, we show that most contemporary algorithms combining RVF with neural network function approximation do not possess the properties which make PSRL effective, and provably fail in sparse reward problems. Moreover, we find that propagation of uncertainty, a property of PSRL previously thought important for exploration, does not preclude this failure. We use these insights to design Successor Uncertainties (SU), a cheap and easy to implement RVF algorithm that retains key properties of PSRL. SU is highly effective on hard tabular exploration benchmarks.


Successor Uncertainties: exploration and uncertainty in temporal difference learning

Janz, David, Hron, Jiri, Hernández-Lobato, José Miguel, Hofmann, Katja, Tschiatschek, Sebastian

arXiv.org Machine Learning

We consider the problem of balancing exploration and exploitation in sequential decision making problems. To explore efficiently, it is vital to consider the uncertainty over all consequences of a decision, and not just those that follow immediately; the uncertainties involved need to be propagated according to the dynamics of the problem. To this end, we develop Successor Uncertainties, a probabilistic model for the state-action value function of a Markov Decision Process that propagates uncertainties in a coherent and scalable way. We relate our approach to other classical and contemporary methods for exploration and present an empirical analysis.